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Abstract — The last decade has witnessed a tremendous growth 
in the volume as well as the diversity of multimedia content 
generated hy a multitude of sources (news agencies, social media, 
etc.). Faced with a variety of content choices, consumers are 
exhibiting diverse preferences for content; their preferences often 
depend on the context in which they consume content as well 
as various exogenous events. To satisfy the consumers’ demand 
for such diverse content, multimedia content aggregators (CAs) 
have emerged which gather content from numerons multimedia 
sources. A key challenge for such systems is to accurately predict 
what type of content each of its consumers prefers in a certain 
context, and adapt these predictions to the evolving consnmers’ 
preferences, contexts and content characteristics. We propose a 
novel, distributed, online multimedia content aggregation frame¬ 
work, which gathers content generated by multiple heterogeneous 
producers to fulfill its consumers’ demand for content. Since 
both the multimedia content characteristics and the consumers’ 
preferences and contexts are unknown, the optimal content 
aggregation strategy is unknown a priori. Onr proposed content 
aggregation algorithm is able to learn online what content 
to gather and how to match content and users by exploiting 
similarities between consumer types. We prove bounds for our 
proposed learning algorithms that guarantee both the accuracy 
of the predictions as well as the learning speed. Importantly, 
our algorithms operate efficiently even when feedback from 
consumers is missing or content and preferences evolve over time. 
Illustrative results highlight the merits of the proposed content 
aggregation system in a variety of settings. 

Index Tenns — Social multimedia, distributed online learning, 
content aggregation, multi-armed bandits. 

I. Introduction 

A plethora of multimedia applications (web-based TV Q, 
personalized video retrieval Q, personalized news aggre¬ 
gation Q, etc.) are emerging which require matching multime¬ 
dia content generated by distributed sources with consumers 
exhibiting different interests. The matching is often performed 
by CAs (e.g., Dailymotion, Metacafe |j^) that are responsible 
for mining the content of numerous multimedia sources in 
search of finding content which is interesting for the users. 
Both the characteristics of the content and preference of the 
consumers are evolving over time. An example of the system 
with users, CAs and multimedia sources is given in Fig. [T] 

Each user is characterized by its context, which is a real¬ 
valued vector, that provides information about the users’ 
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Fig. 1. Operation of the distributed content aggregation system, (i) A 
user with type/context Xi{t) arrives to Content Aggregator (CA) i, (ii) CA 
i chooses a matching action (either requests content from another CA or 
requests content from a multimedia source in its own network). 


content preferences. We assume a model where users arrive 
sequentially to a CA, and based on the type (context) of the 
user, the CA requests content from either one of the multime¬ 
dia sources that it is connected to or from another CA that it 
is connected to. The context can represent information such 
as age, gender, search query, previously consumed content, 
etc. It may also represent the type of the device that the user 
is using 0 (e.g., PDA, PC, mobile phone). The CA’s role 
is to match its user with the most suitable content, which 
can be accomplished by requesting content from the most 
suitable multimedia source|^ Since both the content generated 
by the multimedia sources and the user’s characteristics change 
over time, it is unknown to the CA which multimedia source 
to match with the user. This problem can be formulated 
as an online learning problem, where the CA learns the 
best matching by exploring matchings of users with different 
content providers. After a particular content matching is made, 
the user “consumes” the content, and provides feedback/rating, 
such as like or It is this feedback that helps a CA 

learn the preferences of its users and the characteristics of the 
content that is provided by the multimedia sources. Since this 

* Although we use the term request to explain how content from a multime¬ 
dia source is mined, our proposed method works also when a CA extracts the 
content from the multimedia source, without any decision making performed 
by the multimedia source. 

^Our framework also works when the feedback is missing for some users. 
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is a learning problem we equivalently call a CA, a content 
learner or simply, a learner. 

Two possible real-world applications of content aggregation 
are business news aggregation and music aggregation. Busi¬ 
ness news aggregators can collect information from a vari¬ 
ety of multinational and multilingual sources and make per¬ 
sonalized recommendations to specific individuals/companies 
based on their unique needs (see e.g. Q). Music aggrega¬ 
tors enable matching listeners with music content they enjoy 
both within the content network of the listeners as well as 
outside this network. For instance, distributed music aggre¬ 
gators can facilitate the sharing of music collections owned 
by diverse users without the need for centralized content 
manager/moderator/providers (see e.g. Q). A discussion of 
how these applications can be modeled using our framework 


is given in Section III Moreover, our proposed methods are 


tested on real-world datasets related to news aggregation and 
music aggregation in Section |VII| 

For each CA i, there are two types of users: direct and 
indirect. Direct users are the users that visit the website of 
CA i to search for content. Indirect users are the users of 
another CA that requests content from CA i. A CA’s goal 
is to maximize the number of likes received from its users 
(both direct and indirect). This objective can be achieved by 
all CAs by the following distributed learning method: all CAs 
learn online which matching action to take for its current user, 
i.e., obtain content from a multimedia source that is directly 
connected, or request content from another CA. However, 
it is not trivial how to use the past information collected 
by the CAs in an efficient way, due to the vast number of 
contexts (different user types) and dynamically changing user 
and content characteristics. For instance, a certain type of 
content may become popular among users at a certain point 
in time, which will require the CA to obtain content from the 
multimedia source that generates that type of content. 

To jointly optimize the performance of the multimedia 
content aggregation system, we propose an online learning 
methodology that builds on contextual bandits p0| , pT) . The 
performance of the proposed methodology is evaluated using 
the notion of regret: the difference between the expected total 
reward (number of content likes minus costs of obtaining the 
content) of the best content matching strategy given complete 
knowledge about the user preferences and content character¬ 
istics and the expected total reward of the algorithm used by 
the CAs. When the user preferences and content characteristics 
are static, our proposed algorithms achieve sublinear regret in 
the number of users that have arrived to the system]^ When 
the user preferences and content characteristics are slowly 
changing over time, our proposed algorithms achieve e time- 
averaged regret, where e > 0 depends on the rate of change 
of the user and content characteristics. 

The remainder of the paper is organized as follows. In 
Section |I^ we describe the related work and highlight the 
differences from our work. In Section |IIIj we describe the 
decentralized content aggregation problem, the optimal content 


^We use index t to denote the number of users that have arrived so far. We 
also call t the time index, and assume that one user arrives at each time step. 


matching scheme given the complete system model, and the 
regret of a learning algorithm with respect to the optimal 
content matching scheme. Then, we consider the model with 
unknown, static user preferences and content characteristics 
and propose a distributed online learning algorithm in Section 


IV The analysis of the unknown, dynamic user preferences 


and content characteristics are given in Section|Vl] Using real- 
world datasets, we provide numerical results on the perfor¬ 
mance of our distributed online learning algorithms in Section 


VII Finally, the concluding remarks are given in Section VIII 


II. Related Work 

Related work can be categorized into two: related work on 
recommender systems and related work on online learning 
methods called multi-armed bandits. 

A. Related work on recommender systems and content match¬ 
ing 

A recommender system recommends items to its users based 
on the characteristics of the users and the items. The goal 
of a recommender system is to learn which users like which 
items, and recommend items such that the number of likes is 
maximized. For instance, in 0, 0 a recommender system 
that learns the preferences of its users in an online way 
based on the ratings submitted by the users is provided. It 
is assumed that the true relevance score of an item for a 
user is a linear function of the context of the user and the 
features of the item. Under this assumption, an online learning 
algorithm is proposed. In contrast, we consider a different 
model, where the relevance score need not be linear in the 
context. Moreover, due to the distributed nature of the problem 
we consider, our online learning algorithms need an additional 
phase called the training phase, which accounts for the fact 
that the CAs are uncertain about the information of the other 
aggregators that they are linked with. We focus on the long run 
performance and show that the regret per unit time approaches 
zero when the user and content characteristics are static. 
An online learning algorithm for a centralized recommender 
which updates its recommendations as both the preferences of 
the users and the characteristics of items change over time is 
proposed in | |T3] . 

The general framework which exploits the similarities be¬ 
tween the past users and the current user to recommend content 
to the current user is called collaborative filtering IB-®- 
These methods find the similarities between the current user 
and the past users by examining their search and feedback 
patterns, and then based on the interactions with the past 
similar users, matches the user with the content that has 
the highest estimated relevance score. For example, the most 
relevant content can be the content that is liked the highest 
number of times by similar users. Groups of similar users can 
be created by various methods such as clustering fTS) , and 
then, the matching will be made based on the content matched 
with the past users that are in the same group. 

The most striking difference between our content matching 
system and previously proposed is that in prior works, there is 
a central CA which knows the entire set of different types of 
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Our work 

|5J, 1121 

|15| 

|14| 

|19| 

Distributed 

Yes 

No 

No 

No 

No 

Reward model 

Holder 

Linear 

N/A 

N/A 

N/A 

Confidence bounds 

Yes 

No 

No 

No 

No 

Regret bound 

Yes 

Yes 

No 

No 

No 

Dynamic user 
/content distribution 

Yes 

No 

Yes 

Yes 

Yes 


TABLE I 

Comparison of our work with other work in recommender 

SYSTEMS 

content, and all the users arrive to this central CA. In contrast, 
we consider a decentralized system consisting of many CAs, 
many multimedia sources that these CAs are connected to, 
and heterogeneous user arrivals to these CAs. These CAs are 
cooperating with each other by only knowing the connections 
with their own neighbors but not the entire network topology. 
Hence, a CA does not know which multimedia sources another 
CA is connected to, but it learns over time whether that CA has 
access to content that the users like or not. Thus, our model 
can be viewed as a giant collection of individual CAs that are 
running in parallel. 

Another line of work d), GD uses social streams mined 
in one domain, e.g., Twitter, to build a topic space that 
relates these streams to content in the multimedia domain. 
For example, in HZ)’ Tweet streams are used to provide video 
recommendations in a commercial video search engine. A 
content adaptation method is proposed in ||7) which enables 
the users with different types of contexts and devices to receive 
content that is in a suitable format to be accessed. Video 
popularity prediction is studied in GD. where the goal is 
to predict if a video will become popular in the multimedia 
domain, by detecting social trends in another social media do¬ 
main (such as Twitter), and transferring this knowledge to the 
multimedia domain. Although these methods are very different 
from our methods, the idea of transferring knowledge from 
one multimedia domain to another can be carried out by CAs 
specialized in specific types of cross-domain content matching 
For instance, one CA may transfer knowledge from tweets to 
predict the content which will have a high relevance/popularity 
for a user with a particular context, while another CA may scan 
through the Facebook posts of the user’s friends to calculate 
the context of the domain in addition to the context of the 
user, and provide a matching according to this. 

The advantages of our proposed approach over prior work 
in recommender systems are: (i) systematic analysis of rec¬ 
ommendations’ performance, including confidence bounds on 
the accuracy of the recommendations; (ii) no need for a priori 
knowledge of the users’ preferences (i.e., system learns on- 
the-fly); (iii) achieve high accuracy even when the users’ 
characteristics and content characteristics are changing over 
time; (iv) all these features are enabled in a network of 
distributed CAs. 

The differences of our work from the prior work in recom¬ 
mender systems is summarized in Table |I] 

B. Related Work on Multi-armed Bandits 

Other than distributed content recommendation, our learning 
framework can be applied to any problem that can be formu¬ 
lated as a decentralized contextual bandit problem. Contextual 


bandits have been studied before in GD, GD. GD-GD in a 
single agent setting, where the agent sequentially chooses from 
a set of alternatives with unknown rewards, and the rewards 
depend on the context information provided to the agent at 
each time step. In |[D, a contextual bandit algorithm named 
LinUCB is proposed for recommending personalized news 
articles, which is variant of the UCB algorithm designed 
for linear payoffs. Numerical results on real-world Internet 
data are provided, but no theoretical results on the resulting 
regret are derived. The main difference of our work from single 
agent contextual bandits is that: (i) a three phase learning 
algorithm with training, exploration and exploitation phases is 
needed instead of the standard two phase, i.e., exploration and 
exploitation phases, algorithms used in centralized contextual 
bandit problems; (ii) the adaptive partitions of the context 
space should be formed in a way that each learner/aggregator 
can efficiently utilize what is learned by other learners about 
the same context; (iii) the algorithm is robust to missing 
feedback (some users do not rate the content). 

III. Problem Formulation 

The system model is shown in Fig. [T] There are M 
content aggregators (CAs) which are indexed by the set 
Ai := {1,2,..., M}. We also call each CA a learner since it 
needs to learn which type of content to provide to its users. Let 
Ai-i := AA — {i} be the set of CAs that CA i can choose from 
to request content. Each CA has access to the contents over 
its content network as shown in Fig. [T| The set of contents 
in CA Fs content network is denoted by Ci. The set of all 
contents is denoted by C := The system works in 

a discrete time setting t = 1,2,... ,T, where the following 
events happen sequentially, in each time slot: (i) a user with 
context Xi{t) arrives to each CA i G Alj^(ii) based on the 
context of its user each CA matches its user with a content 
(either from its own content network or by requesting content 
from another CA), (iii) the user provides a feedback, denoted 
by yi{t), which is either like iyi{t) = 1) or dislike {yi{t) = 0). 

The set of content matching actions of CA i is denoted 
by JCi := Ci U AA-i. Let X = [0,1]'^ be the context spacej^ 
where d is the dimension of the context space. The context 
can include many properties of the user such as age, gender, 
income, previously liked content, etc. We assume that all these 
quantities are mapped into [0,1]“^. For instance, this mapping 
can be established by feature extraction methods such as the 
one given in Q- Another method is to represent each property 
of a user by a real number between [0,1] (e.g., normalize 
the age by a maximum possible age, represent gender by set 
{0,1}, etc.), without feature extraction. The feedback set of a 
user is denoted by y := {0,1}. Let Cmax := maxigAi \Ci\- 
We assume that all CAs know Cmax but they do not need to 
know the content networks of other CAs. 

The following two examples demonstrate how business 
news aggregation and music aggregation fits our problem 
formulation. 

'^Although in this model user anivals are synchronous, our framework will 
work for asynchronous user arrivals as well. 

^In general, our results will hold for any bounded subspace of 
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Example 1: Business news aggregation. Consider a dis¬ 
tributed set of news aggregators that operate in different 
countries (for instance a European news aggregator network as 
in Q). Each news aggregator’s content network (as portrayed 
in Eig. 1 of the manuscript) consists of content producers (mul¬ 
timedia sources) that are located in specihc regions/countries. 
Consider a user with context x (e.g. age, gender, nationality, 
profession) who subscribes to the CA A, which is located 
in the country where the user lives. This CA has access to 
content from local producers in that country but it can also 
request content from other CAs, located in different countries. 
Hence, a CA has access to (local) content generated in other 
countries. In such scenarios, our proposed system is able to 
recommend to the user subscribing to CA A also content 
from other CAs, by discovering the content that is most 
relevant to that user (based on its context x) across the entire 
network of CAs. Eor instance, for a user doing business in 
the transportation industry, our content aggregator system may 
learn to recommend road construction news, accidents or gas 
prices from particular regions that are on the route of the 
transportation network of the user. 

Example 2: Music aggregation. Consider a distributed set 
of music aggregators that are specialized in specihc genres 
of music: classical, jazz, rock, rap, etc. Our proposed model 
allows music aggregators to share content to provide person¬ 
alized recommendation for a specihc user. Eor instance, a 
user that subscribes (frequents/listens) to the classical music 
aggregator may also like specihc jazz tracks. Our proposed 
system is able to discover and recommend to that user also 
other music that it will enjoy in addition to the music available 
to/owned by in aggregator to which it subscribes. 

A. User and Content Characteristics 

In this paper we consider two types of user and content 
characteristics. Eirst, we consider the case when the user and 
content characteristics are static, i.e., they do not change over 
time. Eor this case, for a user with context x, 7rc(x) denotes 
the probability that the user will like content c. We call this 
the relevance score of content c. 

The second case we consider corresponds to the scenario 
when the characteristics of the users and content are dynamic. 
Eor online multimedia content, especially for social media, it 
is known that both the user and the content characteristics 
are dynamic and noisy | |24) , hence the problem exhibits 
concept drift p5) . Eotmally, a concept is the distribution of 
the problem, i.e., the joint distribution of the user and content 
characteristics, at a certain point of time p6) . Concept drift is 
a change in this distribution. Eor the case with concept drift, 
we propose a learning algorithm that takes into account the 
speed of the drift to decide what window of past observations 
to use in estimating the relevance score. The proposed learning 
algorithm has theoretical performance guarantees in contrast 
to prior work on concept drift which mainly deal with the 
problem in a ad-hoc manner. Indeed, it is customary to assume 
that online content is highly dynamic. A certain type of content 
may become popular for a certain period of time, and then, 
its popularity may decrease over time and a new content may 
emerge as popular. In addition, although the type of the content 


remains the same, such as soccer news, its popularity may 
change over time due to exogenous events such as the World 
Cup etc. Similarly, a certain type of content may become 
popular for a certain type of demographics (e.g., users of a 
particular age, gender, profession, etc.). However, over time 
the interest of these users may shift to other types of content. 
In such cases, where the popularity of content changes over 
time for a user with context x, tTc^x, t) denotes the probability 
that the user at time t will like content c. 

As we stated earlier, a CA i can either recommend content 
from multimedia sources that it is directly connected to or 
can ask another CA for content. By asking for content c from 
another CA j, CA i will incur cost d) > 0. Eor the purpose 
of our paper, the cost is a generic term. Eor instance, it can 
be a payment made to CA j to display it to CA Es user, or 
it may be associated with the advertising loss CA i incurs by 
directing its user to CA j’s website. When the cost is payment, 
it can be money, tokens | |27t or Bitcoins p8) . Since this cost is 
bounded, without loss of generality we assume that d® G [0,1] 
for all i,j G Ai. In order make our model general, we also 
assume that there is a cost associated with recommending a 
type of content c G Ci, which is given by d® G [0,1], for CA 
i. Eor instance, this can be a payment made to the multimedia 
source that owns content c. 

An intrinsic assumption we make is that the CAs are 
cooperative. That is, CA j G A4-i will return the content 
that is mostly to be liked by CA i’s user when asked by CA 
i to recommend a content. This cooperative structure can be 
justihed as follows. Whenever a user likes the content of CA 
j (either its own user or user of another CA), CA j obtains a 
benefit. This can be either an additional payment made by CA 
i when the content recommended by CA j is liked by CA Es 
user, or it can simply be the case that whenever a content of 
CA j is liked by someone its popularity increases. However, 
we assume that the CAs’ decisions do not change their pool 
of users. The future user arrivals to the CAs are independent 
of their past content matching strategies. Eor instance, users 
of a CA may have monthly or yearly subscriptions, so they 
will not shift from one CA to another CA when they like the 
content of the other CA. 

The goal of CA i is to explore the matching actions in ICi 
to learn the best content for each context, while at the same 
time exploiting the best content for the user with context Xi{t) 
arriving at each time instance t to maximize its total number 
of likes minus costs. CA Es problem can be modeled as a 
contextual bandit problem |Tg, ||^, 1^, where likes 

and costs translate into rewards. In the next subsection, we 
formally dehne the benchmark solution which is computed 
using perfect knowledge about the probability that a content 
c will be liked by a user with context x (which requires 
complete knowledge of user and content characteristics). Then, 
we dehne the regret which is the performance loss due to 
uncertainty about the user and content characteristics. 

B. Optimal Content Matching with Complete Information 

Our benchmark when evaluating the performance of the 
learning algorithms is the optimal solution which always rec¬ 
ommends the content with the highest relevance score minus 
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cost for CA i from the set C given context Xi (t) at time t. This 
corresponds to selecting the best matching action in K-i given 
Xi{t). Next, we define the expected rewards of the matching 
actions, and the action selection policy of the benchmark. For 
a matching action k G M-i, its relevance score is given as 
TTk{x) := Trci(x){x), where c^(a;) := argmax^.^^^. Trdx). For 
a matching action k G Ci its relevance score is equal to the 
relevance score of content k. The expected reward of CA i 
from choosing action k G ICi is given by the quasilinear utility 
function 

Hl{x) := TTk{x) - dl ( 1 ) 

where d]. G [0,1] is the normalized cost of choosing action k 
for CA i. Our proposed system will also work for more general 
expected reward functions as long as the expected reward of a 
learner is a function of the relevance score of the chosen action 
and the cost (payment, communication cost, etc.) associated 
with choosing that action. The oracle benchmark is given by 

k*{x) := argmax^^(a;) \/x G X. (2) 

keKi 

The oracle benchmark depends on relevance scores as well as 
costs of matching content from its own content network or 
other CA’s content network. The case d\ = Q for all k G ICi 
and i G A4, corresponds to the scheme in which content 
matching has zero cost, hence k*(x) = argmax^^.^^. TTk{x) = 
argmax^g^ 7 rc(a;). This corresponds to the best centralized 
solution, where CAs act as a single entity. On the other 
hand, when dl > 1 for all k G M.-i and i G A4, in the 
oracle benchmark a CA must not cooperate with any other 
CA and should only use its own content. Hence k*(x) = 
argmax^g^. ( 7 rc(a;) — d^). In the following subsections, we 
will show that independent of the values of relevance scores 
and costs, our algorithms will achieve sublinear regret (in the 
number of users or equivalently time) with respect to the oracle 
benchmark. 


C. The Regret of Learning 


In this subsection we define the regret as a performance 
measure of the learning algorithm used by the CAs. Simply, 
the regret is the loss incurred due to the unknown system 
dynamics. Regret of a learning algorithm which selects the 
matching action/arm ai{t) at time t for CA i is defined with 
respect to the best matching action k* (cc) given in (|^. Then, 
the regret of CA i at time T is 

T 

Ri{T) •= ^ ^ 

t=l 


-E 


1 

= L) 



(3) 


Regret gives the convergence rate of the total expected reward 
of the learning algorithm to the value of the optimal solution 
given in Any algorithm whose regret is sublinear, i.e., 
Ri(T) = 0{T"i) such that 7 < 1, will converge to the optimal 
solution in terms of the average reward. 

A summary of notations is given in Table [n| In the next sec¬ 
tion, we propose an online learning algorithm which achieves 


M-. Set of all CAs 

Ci’. Contents in the Content Network of CA i 

C'niax’ |Ci| 

C: Set of all contents 

X = [0,1]'^: Context space 

V: Set of feedbacks a user can give 

Xi(t)\ d-dimensional context of fth user of CA i 

yi{t): Feedback of the fth user of CA i 

ICi'. Set of content matching actions of CA i 

7rc(a:): Relevance score of content c for context x 

d\.\ Cost of choosing matching action k for CA i 

pf(x)\ Expected reward (static) of CA i from 
matching action k for context x 

k*{x): Optimal matching action of CA i given 
context x (oracle benchmark) 

Ri{T)\ Regret of CA i at time T 

Pa := iJCli l/f“ 


TABLE II 

Notations used in problem eormulation. 


sublinear regret when the user and content characteristics are 
static. 

IV. A Distributed Online Content Matching 
Algorithm 

In this section we propose an online learning algorithm for 
content matching when the user and content characteristics 
are static. In contrast to prior online learning algorithms that 
exploit the context information pO) , |TT), f^, 

which consider a single learner setting, the proposed algorithm 
helps a CA to learn from the experience of other CAs. With 
this mechanism, a CA is able to recommend content from 
multimedia sources that it has no direct connection, without 
needing to know the IDs of such multimedia sources and their 
content. It learns about these multimedia sources only through 
the other CAs that it is connected to. 

In order to bound the regret of this algorithm analytically we 
use the following assumption. When the content characteristics 
are static, we assume that each type of content has similar 
relevance scores for similar contexts; we formalize this in 
terms of a Lipschitz condition. 

Assumption 1: There exists L > 0, 7 > 0 such that for all 
x,x' G X and c S C, we have | 7 rc(a;) — 7 rc(x')| < LWx — x'W^. 

Assumption indicates that the probability that a type c 
content is liked by users with similar contexts will be similar to 
each other. For instance, if two users have similar age, gender, 
etc., then it is more likely that they like the same content. We 
call L the similarity constant and 7 the similarity exponent. 
These parameters will depend on the characteristics of the 
users and the content. We assume that 7 is known by the CAs. 
However, an unknown 7 can be estimated online using the 
history of likes and dislikes by users with different contexts, 
and our proposed algorithms can be modified to include the 
estimation of 7 . 

In view of this assumption, the important question becomes 
how to learn from the past experience which content to 
match with the current user. We answer this question by 
proposing an algorithm which partitions the context space of 
a CA, and learns the relevance scores of different types of 
content for each set in the partition, based only on the past 
experience in that set. The algorithm is designed in a way to 
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Fig. 2. Content matching within own content network. 
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© Matched content 


Fig. 3. Content matching from the network of another CA. 


achieve optimal tradeoff between the size of the partition and 
the past observations that can be used together to learn the 
relevance scores. It also includes a mechanism to help CAs 
learn from each other’s users. We call our proposed algorithm 
the Distributed COntent Matching algorithm (DISCOM), and 
its pseudocode is given in Fig. El Fig- Sand Fig. |§ 

Each CA i has two tasks: matching content with its own 
users and matching content with the users of other CAs when 
requested by those CAs. We call the hrst task the maximization 
task (implemented by DISCOMmax given in Fig. [^, since the 
goal of CA i is to maximize the number of likes from its 
own users. The second task is called the cooperation task 
(implemented by DISCOMcoop given in Fig. |^, since the 
goal of CA i is to help other CAs obtain content from its 
own content network in order to maximize the likes they 
receive from their users. This cooperation is beneficial to 
CA i because of numerous reasons. Firstly, since every CA 
cooperates, CA i can reach a much larger set of content 
including the content from other CA’s content networks, hence 
will be able to provide content with higher relevance score to 
its users. Secondly, when CA i helps CA j, it will observe the 
feedback of CA j’s user for the matched content, hence will 
be able to update the estimated relevance score of its content, 
which is beneficial if a user similar to CA j’s user arrives 
to CA i in the future. Thirdly, payment mechanisms can be 
incorporated to the system such that CA i gets a payment from 
CA j when its content is liked by CA j’s user. 

In summary, there are two types of content matching actions 
for a user of CA i. In the first type, the content is recommended 
from a source that is directly connected to CA i, while in 
the second type, the content is recommended from a source 
that CA i is connected through another CA. The information 
exchange between multimedia sources and CAs for these two 
types of actions is shown in Fig. and Fig. 

Let T be the time horizon of interest (equivalent to the 
number of users that arrive to each CA). DISCOM creates a 
partition of X = [0,1]“^ based on T. For instance T can be 


the average number of visits to the CA’s website in one day. 
Although in reality the average number of visits to different 
CAs can be different, our analysis of the regret in this section 
will hold since it is the worst-case analysis (assuming that 
users arrive only to CA i, while the other CAs only learn 
through CA i’s users). Moreover, the case of heterogeneous 
number of visits can be easily addressed if each CA informs 
other CAs about its average number of visits. Then, CA i can 
keep M different partitions of the context space; one for itself 
and M —1 for the other CAs. If called by CA j, it will match 
a content to CA j’s user based on the partition it keeps for 
CA j. Hence, we focus on the case when T is common to all 
CAs. 

We hrst define uit as the slicing level used by DISCOM, 
which is an integer that is used to partition X. DISCOM forms 
a partition of X consisting of {nriTY (hypercubes) where 
each set is a d-dimensional hypercube with edge length l/my. 
This partition is denoted by Vt- The hypercubes in Vt are 
oriented such that one of them has a corner located at the 
origin of the d-dimensional Euclidian space. It is clear that the 
number of hypercubes is increasing in rriT, while their size is 
decreasing in rriT- When uit is small each hypercube covers 
a large set of contexts, hence the number of past observations 
that can be used to estimate relevance scores of matching 
actions in each set is large. However, the variation of the true 
relevance scores of the contexts within a hypercube increases 
with the size of the hypercube. DISCOM should set tot to a 
value that balances this tradeoff. 

A hypercube in Vt is denoted by p. The hypercube in Vt 
that contains context Xi{t) is denoted by Piit). When Xi{t) 
is located at a boundary of multiple hypercubes in Vti it is 
randomly assigned to one of these hypercubes. 

DISCOMmax operates as follows. CA i matches its user at 
time t with a content by taking a matching action based on 
one of the three phases: training phase in which CA i requests 
content from another CA j for the purpose of helping CA j 
to learn the relevance score of content in its content network 
for users with context Xi{t) (but CA i does not update the 
relevance score for CA j because it thinks that CA j may not 
know much about its own content), exploration phase in which 
CA i selects a matching action in ICi and updates its relevance 
score based on the feedback of its user, and exploitation phase 
in which CA i chooses the matching action with the highest 
relevance score minus cost. 

Since the CAs are cooperative, when another CA requests 
content from CA i, CA i will choose content from its content 
network with the highest estimated relevance score for the user 
of the requesting CA. To maximize the number of likes minus 
costs in exploitations, CA i must have an accurate estimate 
of the relevance scores of other CAs. This task is not trivial 
since CA i does not know the content network of other CAs. In 
order to do this, CA i should smartly select which of its users’ 
feedbacks to use when estimating the relevance score of CA j. 
The feedbacks should come from previous times at which CA 
i has a very high confidence that the content of CA j matched 
with its user is the one with the highest relevance score for the 
context of CA i’s user. Thus, the training phase of CA i helps 
other CAs build accurate estimates about the relevance scores 
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DISCOM for CA i: 

1: Input: H 2 {t), Hsit), T, mr 

2: Initialize: Partition X into hypercubes denoted by Vt 

3: Initialize: Set counters N'p = 0, Vp G Vt, 

Nip = 0,VA: G lCi,p G Vt, N'^;; = 0,'ij G M-i,p G Vt 


4: 

5: 

6 : 

7: 

8 : 


10 : 


11 : 

12 : 

13: 

14: 

15: 

16: 

17: 

18: 

19: 

20 : 

21 : 

22 : 

23: 

24: 

25: 

26: 

27: 

28: 

29: 

30: 

31: 

32: 

33: 

34: 


Initialize: Set relevance score estimates p = 0, Vfc G ICi, 
p G Vt 

while f > 1 do 

Run DISCOMmax to find p = Pi{t), to obtain a matching 
action at, and value of train flag 
If ai £ M-i ask CA Oi for content and pass Xi{t) 
Receive CAi{t), the set of CAs who requested content 
from CA i, and their contexts 
if CAi{t) / 0 then 

Run DISCOMcoop to obtain the content to be selected 
bi ~ {bi,j}j£CAi(t) and hypercubes that the contexts 
of the users in CAi{t) lie in := {PijjjeCAiit') 

end if 

if tti G Ci then 

Pay cost da-, obtain content 

Show ai to the user, receive feedback r G {0,1} 

drawn from Bera;(a;i (i)fl 

else 

Pay cost da-, obtain content ba^^i from CA ai 
Show to the user, receive feedback r G {0,1} 
drawn from Ber;,^. 

end if 

if train = 1 then 
Nalp + + 

6lS6 

K,.p = {rl,pNl^p + r)/iNl^p + l) 

n; + +, + + 

end if 


if CAi{t) / 0 then 
for j G CAi{t) do 

Send content 6ij to CA y’s user 
Observe feedback r drawn from Ber;,, 


(%W) 


’Pi,' 


+ +, Nl 

end for 
end if 

t = t + l 

end while 


p- +1 
’Pi,j 

+ + 


“Bera; (xi(t)) is the Bernoulli distribution with expected value 
’n'aiiXiit)) 


Fig. 4. Pseudocode for DISCOM algorithm. 


of their content, before CA i uses any feedback for content 
coming from these CAs to form relevance score estimates 
about them. In contrast, the exploration phase of CA i helps 
it to build accurate estimates about the relevance score of its 
matching actions. 

At time t, the phase that CA i will be in is determined by 
the amount of time it had explored, exploited or trained for 
past users with contexts similar to the context of the current 
user. For this CA i keeps counters and control functions which 
are described below. Let Np{t) be the number of user arrivals 
to CA i with contexts in p G Vp by time t (its own arrivals 
and arrivals to other CAs who requested content from CA i) 
except the training phases of CA i. For c G Ci, let N^p{t) 
be the number of times content c is selected in response to a 
user arriving to CA i with context in hypercube p by time t 
(including times other CAs request content from CA i for their 
users with contexts in set p). Other than these, CA i keeps two 
counters for each other CA in each set in the partition, which 


DISCOMmax (maximization part of DISCOM) for CA i: 

1 

train — 0 

2 

Find the hypercube in Vt that Xi{t) belongs to, i.e., Pi{t) 

3 

Let p = pi{t) 

4 

Compute the set of under-explored matching actions Cil{t) 
given in (Hll 

5 

if CT,p{ty^ 0 then 

Select ai randomly from C“ p(f) 

6 

7 

else 

8 

Compute the set of training candidates „(f) given in 

9 

//Update the counters of training candidates 

10 

for j G MXp{t) do 

11 

Obtain from CA j, set N'’’^ = - N]^p 

12 

end for 

13 

Compute the set of under-trained CAs given in 

14 

Compute the set of under-explored CAs given in 

<l7l 

15 

u f 0 then 

16 

Select ai randomly from A4fp{t), train = 1 

17 

else if Mfplt) f 0 then 

18 

Select ai randomly from M'fp(t) 

19 

else 

20 

Select ai randomly from argmaxj,g^. ^ — d\, 

21 

end if 

22 

end if 


Fig. 5. Pseudocode for the maximization part of DISCOM algorithm. 


DISCOMcoop (cooperation part of DISCOM) for CA i 
1: for j G CAi(t) do 

2: Find the set in Vt that Xj (t) belongs to, i.e., pi j 

3: Compute the set of under-explored matching actions 

J (t) given in Jll 
4: ifCt,_^(f)7^0thki 

5: Select bij randomly from . (f) 

6: else 

7: = argmax^gCi 

8 : end if 

9: end for 


Fig. 6. Pseudocode for the cooperation part of DISCOM algorithm. 


it uses to decide the phase it should be in. The first one, i.e., 
is an estimate on the number of user arrivals with 
contexts in p to CA j from all CAs except the training phases 
of CA j and exploration, exploitation phases of CA i. This 
counter is only updated when CA i thinks that CA j should 
be trained. The second one, i.e., ^{t), counts the number of 

users of CA i with contexts in p for which content is requested 
from CA j at exploration and exploitation phases of CA i by 
time t. 

At each time slot t, CA i first identifies Pi{t). Then, it 
chooses its phase at time t by giving highest priority to 
exploration of content in its own content network, second 
highest priority to training of the other CAs, third highest 
priority to exploration of the other CAs, and lowest priority 
to exploitation. The reason that exploration of own content 
has a higher priority than training of other CAs is that it will 
minimize the number of times CA i will be trained by other 
CAs, which we describe below. 

First, CA i identifies the set of under-explored content in 
its content network: 

C'l^p{t):={c£Cr.Nlpit)<H,{t)} (4) 

where Hi {t) is a deterministic, increasing function of t which 
is called the control function. The value of this function will 

















affect the regret of DISCOM. For c S Ci, the accuracy of 
relevance score estimates increase with hence it should 

be selected to balance the tradeoff between accuracy and 
the number of explorations. If is non-empty, CA i 

enters the exploration phase and randomly selects a content in 
this set to explore. Otherwise, it identifies the set of training 
candidates; 

:= {3 e (5) 

where H 2 {t) is a control function similar to Hi{t). Accuracy 
of other CA’s relevance score estimates of content in their own 
networks increases with H 2 {t), hence it should be selected to 
balance the possible reward gain of CA i due to this increase 
with the reward loss of CA i due to the number of trainings. If 
this set is non-empty, CA i asks the CAs j G -Mfpit) to report 
Based in the reported values it recomputes Njp{t) as 
= N^(t) — Nj^p(t). Using the updated values, CA i 
identifies the set of under-trained CAs: 

MXpit) := {j e : N^;;{t) < H2it)}. (6) 

If this set is non-empty, CA i enters the training phase and 
randomly selects a CA in this set to train it. When M.fp{t) or 
Aifp(t) is empty, this implies that there is no under-trained 
CA, hence CA i checks if there is an under-explored matching 
action. The set of CAs for which CA i does not have accurate 
relevance scores is given by 

M‘i:p(t) := {3 GM-r- Nlp{t) < (7) 

where H^{t) is also a control function similar to If 

this set is non-empty, CA i enters the exploration phase and 
randomly selects a CA in this set to request content from to 
explore it. Otherwise, CA i enters the exploitation phase in 
which it selects the matching action with the highest estimated 
relevance score minus cost for its user with context Xi{t) G 
p = i.e., 

ai{t) G argmaxffc (f) -4 ( 8 ) 

kGJCi 

where ^{t) is the sample mean estimate of the relevance 
score of CA i for matching action k at time t, which is 
computed as follows. For j G A4-i, let £jp{t) be the set 
of feedbacks collected by CA i at times it selected CA j 
while CA i’s users’ contexts are in set p in its exploration 
and exploitation phases by time t. For estimating the relevance 
score of contents in its own content network, CA i can also 
use the feedback obtained from other CAs’ users at times they 
requested content from CA i. In order to take this into account, 
for c G Ci, let p{t) be the set of feedbacks observed by CA i 
at times it selected its content c for its own users with contexts 
in set p union the set of feedbacks observed by CA i when it 
selected its content c for the users of other CAs with contexts 
in set p who requests content from CA i by time t. 

Therefore, sample mean relevance score of matching action 
k G ICi for users with contexts in set p for CA i is defined as 
/I4 p(f)|. An important observation 
is that computation of does not take into account the 

matching costs. Let fi], ^{t) := f], p{t)—d], be the estimated net 


L: Similarity constant. 7 : Similarity exponent 

T: Time horizon 

mr: Slicing level of DISCOM 

Vt'- DISCOM’s partition of X into (mr)”' hypercubes 

Pi{t): Hypercube in Vt that contains Xi{t) 

Np{t)'. Number of all user arrivals to CA i with contexts in 
p G Vt by time t except the training phases of CA i 

Nc^p{t): Number of times content c is selected in response to 
a user arriving to CA i with context in hypercube p by time t 

N'kp{t): Estimate of CA i on the number of user arrivals with 
contexts in p to CA j from all CAs except the training phases 
of CA j and exploration, exploitation phases of CA i 

Nj p(t)'. Number of users of CA i with contexts in p for which 
content is requested from CA j at exploration and exploitation 
phases of CA i by time t 

H 2 {t), H 3 {t): Control functions of DISCOM 

C“p(f): Set of under-explored content in Ci 

Set of training candidates of CA i 

Set of CAs under-trained by CA i 

Mi‘p(t): Set of CAs under-explored by CA i 

rfe,p(f); Sample man relevance score of action k of CA i at time t 

Pk.vi^) Estimated net reward of action k of CA i at time t 


TABLE III 

Notations used in definition of DISCOM. 


reward (relevance score minus cost) of matching action k for 
set p. Of note, when there is more than one maximizer of ([^, 
one of them is randomly selected. In order to run DISCOM, 
CA i does not need to keep the sets £l p{t) in its memory. 
f\p{t) can be computed by using only f\p{t — 1 ) and the 
feedback at time t. 

The cooperation part of DISCOM, i.e., DISCOMcoop oper¬ 
ates as follows. Let CAi{t) be the set CAs who request content 
from CA i at time t. For each j G CAi(t), CA i first checks if 
it has any under-explored content c for Pj{t), i.e., c such that 
^cp-{t)(^) — Hi{t)Af so, it randomly selects one of its under¬ 
explored content to match it with the user of CA j. Otherwise, 
it exploits its content in Ci with the highest estimated relevance 
score for CA j’s current user’s context, i.e.. 


hjit) G argmaxr* ^(f). 
cGC, ^ 


(9) 


A summary of notations used in the description of DISCOM 
is given in Table III The following theorem provides a bound 
on the regret of DISCOM. 

Theorem 1: When DISCOM is run by all CAs with param¬ 
eters Hi{t) = f 27 /( 37 +d) log f, 

Hz{t) = i 27 /( 37 -i-d) logf and ruT = |'yi/(37-i-d)l have 

< 4(M + + 1)/32 

, / 14LdT'/2 + 12 + 4(141 + M)MCn,ax4 

_L / 37 + d - 

V {‘^1 + d) / {iT + d) 

+2'^+^ZAogT) 

+ T^2'^+^{\Ci\ + 2{M - 1)), 
i.e., R,{T) = 6 (MCn^axT^)!^ where 4 = |4| + (M - 


^For a number r E M, let [r] be the smallest integer that is greater than 
or equal to r. 

^O(-) is the Big-0 notation in which the terms with logarithmic growth 
rates are hidden. 
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l)(Cmax + !)■ 

Proof: The proof is given Appendix ■ 

For any d > 0 and 7 > 0, the regret given in Theorem is 
sublinear in time (or number of user arrivals). This guarantees 
that the regret per-user, i.e., the time averaged regret, converges 
to 0 (lim 7 ’_>oo E[i?i(r)]/T = 0). It is also observed that 
the regret increases in the dimension d of the context. By 
Assumption [T] a context is similar to another context if they 
are similar in each dimension, hence number of hypercubes in 
the partition Vt increases with d. 

In our analysis of the regret of DISCOM we assumed that 
T is fixed and given as an input to the algorithm. DISCOM 
can be made to run independently of the hnal time T by 
using a standard method called the doubling trick (see, e.g., 
pO)). The idea is to divide time into rounds with geometrically 
increasing lengths and run a new instance of DISCOM at each 
round. For instance, consider rounds t S {1,2,...}, where 
each round has length 2’’. Run a new instance of DISCOM 
at the beginning of each round with time parameter 2'^. This 
modified version will also have O regret. 

Maximizing the satisfaction of an individual user is as im¬ 
portant as maximizing the overall satisfaction of all users. The 
next corollary shows that by using DISCOM, CAs guarantee 
that their users will almost always be provided with the best 
content available within the entire content network. 

Corollary 1: Assume that DISCOM is run with the set 
of parameters given in Theorem [T] When DISCOM is in 
exploitation phase for CA i, we have 

- 5t) 

^ 2\1C,\ 2|^,|MCn,ax^2 

“ ^2 ^7/(37-|-d) 

where 5t = {QLdP/'^ + . 

Proof: The proof is given Appendix ■ 

Remark 1: (Differential Services) Maximizing the perfor¬ 
mance for an individual user is particularly important for 
providing differential services based on the types of the users. 
For instance, a CA may want to provide higher quality 
recommendations to a subscriber (high type user) who has 
paid for the subscription compared to a non-subscriber (low 
type user). To do this, the CA can exploit the best content for 
the subscribed user, while perform exploration on a different 
user that is not subscribed. 

V. Regret When Feedback is Missing 

When analyzing the performance of DISCOM, we assumed 
that the users always provide a feedback: like or dislike. 
However, in most of the online content aggregation platforms 
user feedback is not always available. In this section we 
consider the effect of missing feedback on the performance 
of the proposed algorithm. We assume that each user gives a 
feedback with probability pr (which is unknown to the CAs). 
If the user at time t does not give feedback, we assume that 
DISCOM does not update its counters. This will result in 
a larger number of trainings and explorations compared to 
the case when feedback is always available. The following 
theorem gives an upper bound on the regret of DISCOM for 
this case. 


Theorem 2: Let the DISCOM algorithm run with parame¬ 
ters Hfft) = f 27 /( 37 +d)logt, H 2 {f) = 

Hfff) = i 27 /( 37 +d) logf, and rriT = |'ri/07+d)1_ Then, if a 
user reveals its feedback with probability pr, we have for CA 

i. 


i?,(r) < 4(M + C„,ax + 1)/32 
, / ULd^/^ + 12 + 4(|C,| + M)MCn,ax/32 

-L / 3~f + d - 

V i‘^l + d)/i3'y + d) 





|C,|+2(M-1) 

Pr 


~ / 27+d \ 

i.e., Ri{T) = O \^MCnia.xT^-y+'^ /Prj , where Zi = \Ci\ -f 

(M - l)(C„,ax + 1), Pa ■■= EZl 1A“- 

Proof: The proof is given Appendix ■ 

From Theorem we see that missing feedback does not 
change the time order of the regret. However, the regret is 
scaled with 1 /pr, which is the expected number of users 
required for a single feedback. 


VI. Learning Under Dynamic User and Content 
Characteristics 


When the user and content characteristics change over time, 
the relevance score of content c for a user with context x 
changes over time. In this section, we assume that the follow¬ 
ing relation holds between the probabilities that a content will 
be liked with users with similar contexts at two different times 
t and t'. 

Assumption 2: For each c G C, there exists L > 0, 7 > 0 
such that for all x, x' G A’, we have 


kc 7 (a:) - TTc^tfx')! < L{\\x - cc'H)^ + 1^ - t'\/Ts 


where l/T^ > 0 is the speed of the change in user and content 
characteristics. We call Tg the stability parameter. 

Assumption captures the temporal dynamics of content 
matching which is absent in Assumption Such temporal 
variations are often referred to as concept drift 0, ig. 
When there is concept drift, a learner should also consider 
which past information to take into account when learning, in 
addition to how to combine the past information to learn the 
best matching strategy. 

The following modification of DISCOM will deal with 
dynamically changing user and content characteristics by using 
a time window of past observations in estimating the relevance 
scores. The modified algorithm is called DISCOM with time 
window (DISCOM-W). This algorithm groups the time slots 
into rounds Q = 1 , 2 ,... each having a fixed length of 2 r/j time 
slots, where Th is an integer called the half window length. 
Some of the time slots in these rounds overlap with each 
other as given in Fig. The idea is to keep separate control 
functions and counters for each round, and calculate the 
sample mean relevance scores for groups of similar contexts 
based only on the observations that are made during the time 
window of that round. We call 77 = 1 the initialization round. 
The control functions for the initialization round of DISCOM- 
W is the same as the control functions Hi{t), H 2 {t) and Hfff) 
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of DISCOM whose values are given in Theorem For the 
other rounds C > the control functions depend on Th and 
are given as 

(t) = (t) = (t mod Th + 1 )^ log(f mod Th + 1 ) 

and 

iFj'* (t) = C'max(f mod Th + 1)^ log(f mod Th + 1) 

for some 0 < 2 < 1. Each round rj is divided into two sub¬ 
rounds. Except the initialization round, i.e., 77 = 1, the first 
sub-round is called the passive sub-round, while the second 
sub-round is called the active sub-round. Eor the initialization 
round both sub-rounds are active sub-rounds. In order to 
reduce the number of trainings and explorations, DISCOM- 
W has an overlapping round structure as shown in Eig. |7] 
Eor each round except the initialization round, passive sub¬ 
rounds of round p, overlaps with the active sub-round of round 
77 — 1. DISCOM-W operates in the same way as DISCOM in 
each round. DISCOM-W can be viewed as an algorithm which 
generates a new instance of DISCOM at the beginning of each 
round, with the modified control functions. DISCOM-W runs 
two different instances of DISCOM at each round. One of 
these instances is the active instance based on which content 
matchings are performed, and the other one is the passive 
instance which learns through the content matchings made by 
the active instance. 

Let the instance of DISCOM that is run by DISCOM-W 
at round 77 be DISCOM^. The hypercubes of DISCOM^ are 
formed in a way similar to DISCOM’s. The input time horizon 
is taken as Tg which is the stability parameter given in As¬ 
sumption 1 ^ and the slicing parameter rriT^ is set accordingly. 
DISCOM^ uses the partition of X into (niT^)’^ hypercubes 
denoted by Vt^- When all CAs are using DISCOM-W, the 
matching action selection of CA i only depends on the history 
of content matchings and feedback observations at round 77 . If 
time t is in the active sub-round of round 77 , matching action of 
CA 7 e is taken according to DISCOM,j. As a result of the 
content matching, sample mean relevance scores and counters 
of both DISCOM,, and DISCOM ,,+1 are updated. Else if time 
t is in the passive sub-round of round 77 , matching action of 
CA 7 G Af is taken according to DISCOM,,_i (see Eig. 0. 
As a result of this, sample mean relevance scores and counters 
of both DISCOM,,_i and DISCOM,, are updated. 

At the start of a round 77 , the relevance score estimates and 
counters for DISCOM,, are equal to zero. However, due to the 
two sub-round structure, when the active sub-round of round 
77 starts, CA i already has some observations for the context 
and actions taken in the passive sub-round of that round, hence 
depending on the arrivals and actions in the passive sub-round, 
the CA may even start the active sub-round by exploiting, 
whereas it should have always spent some time in training 
and exploration if it starts an active sub-round without any 
past observations (cold start problem). 

In this section, due to the concept drift, even though the 
context of a past user can be similar to the context of the 
current user, their relevance scores for a content c can be 
very different. Hence DISCOM-W assumes that a past user 
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sub-round 
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Tj-l 



Round 
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Fig. 7. Operation of DISCOM-W showing the round structure and the 
different instances of DISCOM running for each round. 

is similar to the current user only if it arrived in the current 
round. Since round length is fixed, it is impossible to have 
sublinear number of similar context observations for every 
t. Thus, achieving sublinear regret under concept drift is not 
possible. Therefore, in this section we focus on the average 
regret which is given by 

f](l(7/,(f)=L). 

.t=l 

The following theorem bounds the average regret of 
DISCOM-W. 

Theorem 3: When DISCOM-W is run with parameters 

= [t mod Th + 1 ) 3 ^ log(f mod r/, -f 1) 
mod Th + log(f mod Th + 1) 

rriT^ = |’T/3'+‘'] and Th = [Ti^"’'^‘^^'^^^’^~'’‘^^j|^where Tg is the 

stability parameter which is given in Assumption the time 
averaged regret of CA 7 by time T is 

Rf\T) = 6 

for any T > 0. Hence DISCOM-W is e = (5 (^Tg^'^ 

approximately optimal in terms of the average reward. 

Proof: The proof is given Appendix ■ 

Erom the result of this theorem we see that the average 
regret decays as the stability parameter Tg increases. This is 
because, DISCOM-W will use a longer time window (round) 
when Tg is large, and thus can get more observations to esti¬ 
mate the sample mean relevance scores of the matching actions 
in that round, which will result in better estimates hence 
smaller number of suboptimal matching action selections. 
Moreover, the average number of trainings and explorations 
required decrease with the round length. 

*For a number h, [bj denotes the largest integer that is smaller than or 
equal to b. 
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VII. Numerical Results 

In this section we provide numerical results for our proposed 
algorithms DISCOM and DISCOM-W on real-world datasets. 

A. Datasets 

For all the datasets below, for a CA the cost of choosing a 
content within the content network and the cost of choosing 
another CA is set to 0. Hence, the only factor that affects the 
total reward is the users’ ratings for the contents. 

Yahoo! Today Module (YTM) Q: The dataset contains 
news article webpage recommendations of Yahoo! Front Page. 
Each instance is composed of (i) ID of the recommended 
content, (ii) the user’s context (2-dimensional vector), (iii) the 
user’s click information. The user’s click information for a 
webpage/content is associated with the relevance score of that 
content. It is equal to 1 if the user clicked on the recommended 
webpage and 0 else. The dataset contains T — 70000 instances 
and 40 different types of content. We generate 4 CAs and 
assign 10 of the 40 types of content to each CA’s content 
network. Each CA has direct access to content in its own 
network, while it can also access to the content in other CAs’ 
content network by requesting content from these CAs. Users 
are divided into four groups according to their contexts and 
each group is randomly assigned to one of the CAs. Hence, 
the user arrival processes to different CA’s are different. The 
performance of a CA is evaluated in terms of the average 
number of clicks, i.e., click through rate (CTR), of the contents 
that are matched with its users. 

Music Dataset (MD); The dataset contains contextual in¬ 
formation and ratings (like/dislike) of music genres (classical, 
rock, pop, rap) collected from 413 students at UCLA. We 
generate 2 CAs each specialized in two of the four music 
genres. Users among the 413 users randomly arrive to each 
CA. A CA either recommends a music content that is in its 
content network or asks another CA, specialized in another 
music genre, to provide a music item. As a result, the rating 
of the user for the genre of the provided music content is 
revealed to the CA. The performance of a CA is evaluated in 
terms of the average number of likes it gets for the contents 
that are matched with its users. 

Yahoo! Today Module (YTM) with Drift (YTMD); This 
dataset is generated from YTM to simulate the scenario where 
the user ratings for a particular content changes over time. 
After every 10000 instances, 20 contents are randomly selected 
and user clicks for these contents are set to 0 (no click) for 
the next 10000 instances. Eor instance, this can represent a 
scenario where some news articles lose their popularity a day 
after they become available while some other news articles 
related to ongoing events will stay popular for several days. 

B. Learning Algorithms 

While DISCOM and DISCOM-W are the first distributed 
algorithms to perform content aggregation (see Table [J, we 
compare their performance with distributed versions of the 
centralized algorithms proposed in 0, fig, fig, In the 
distributed implementation of these centralized algorithms, we 
assume that each CA runs an independent instance of these 


algorithms. Eor instance, when implementing a centralized 
algorithm A on the distributed system of CAs, we assume that 
each CA i runs its own instance of A denoted by A^. When 
CA i selects CA j as a matching action in ICi by using its 
algorithm Ai, CA j will select the content for CA i using 
its algorithm Aj with CA Fs user’s context on the set of 
contents Cj. In our numerical results, each algorithm is run 
for different values of its input parameters. The results are 
shown for the parameter values for which the corresponding 
algorithm performs the best. 

DISCOM; Our algorithm given in Eig. |g with control 
functions Hi(t), H2{t) and i73(f) divided by 10 for MD, and 
by 20 for YTM and YTMD to reduce the number of trainings 
and explorations]^ 

DISCOM-W; Our algorithm given in Eig. which is the 
time-windowed version of DISCOM with control functions 
Hi{t), H2(t) and H-iit) divided by 20 to reduce the number 
of trainings and explorations. 

As we mentioned in Remark [T] both DISCOM and 
DISCOM-W can provide differential services to its users. In 
this case both algorithms always exploit for the users with high 
type (subscribers) and if necessary can train and explore for the 
users with low type (non-subscribers). Hence, the performance 
of DISCOM and DISCOM-W for differential services is equal 
to their performance for the set of high type users. 

LinUCB 0, 1^ ; This algorithm computes an index for 
each matching action by assuming that the relevance score 
of a matching action for a user is a linear combination of the 
contexts of the user. Then for each user it selects the matching 
action with the highest index. 

Hybrid-e f^; This algorithm forms context-dependent 
sample mean rewards for the matching actions by considering 
the history of observations and decisions for groups of contexts 
that are similar to each other. Eor user t it either explores a 
random matching action with probability et or exploits the best 
matching action with probability 1 — et, where e* is decreasing 
in t. 

Contextual zooming (CZ) pO) ; This algorithm adaptively 
creates balls over the joint action and context space, calculates 
an index for each ball based on the history of selections of that 
ball, and at each time step selects a matching action according 
to the ball with the highest index that contains the current 
context. 


C. Yahoo! Today Module Simulations 

In YTM each instance (user) has two con texts (a :i,cc 2 ) G 
[0,1]^. We simulate the algorithms in Section 


VII-B 


for three 

different context sets in which the learning algorithms only 
decide based on (i) the hrst context xi, (ii) the second context 
X 2 , and (iii) both contexts {xi,X 2 ) of the users. The rriT 
parameter of DISCOM for these simulations is set to the 
optimal value found in Theorem 1 (for 7 = 1) which is 
for simulations with a single context and for 

simulations with both contexts. DISCOM is run for numerous 
2 values ranging from 1/4 to 1/2. Table IV compares the 


®The number of trainings and explorations required in the regret bounds 
are the worst-case numbers. In reality, good performance is achieved with a 
much smaller number of trainings and explorations. 
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Context 

DISCOM 

DISCOM 
(diff. serv.) 

LinUCB 

Hybrid-e 

CZ 

Xl 

6.37 

7.30 

6.31 

5.92 

4.29 

X2 

6.14 

6.45 

4.72 

6.14 

4.39 

(Xl, X 2 ) 

5.93 

6.61 

5.65 

6.15 

4.24 


Contexts 

used 

DISCOM-W 
(diff. serv.) 

DISCOM 
(diff. serv.) 

LinUCB 

Hybrid-e 

CZ 

Xl 

6.3 

5.5 

5.1 

3.0 

2.4 

X2 

5.1 

4.2 

4.2 

4.6 

2.4 

(X1,X2) 

6.9 

3.8 

4.6 

4.1 

2.3 


TABLE IV 

Comparison of the CTRx 10^ achieved by CA 1 for DISCOM and 

OTHER LEARNING ALGORITHMS FOR YTM. 


z 

1/4 

1/3 

1/2 

ctry<I(P 

5.13 

5.29 

6.14 

CTR X 10 ^ in exploitations 

5.14 

5.34 

6.45 

Exploit % 

98.9 

97.7 

90.2 

Explore % 

0.5 

0.6 

1.9 

Train % 

0.7 

1.7 

7.9 


TABLE V 

The CTR, training, exploration and exploitation percentages 
OF CA 1 USING DISCOM WITH CONTEXT X 2 FOR YTM. 


TABLE VII 

The CTRx 10^ OF DISCOM-W and DISCOM for differential 
SERVICES, AND THE CTR OF OTHER LEARNING ALGORITHMS FOR 
YTMD. 

also evaluated at their best parameter values. The results show 
the performance of DISCOM and DISCOM-W for differential 
services. DISCOM-W performs the best in this dataset in terms 
of the average number of clicks, with about 23%, 11.3% and 
51.6% improvement over the best of LinUCB, Hybrid-e and 
CZ, for types of contexts xi, X 2 , and (xi,X 2 ), respectively. 


performance of DISCOM, LinUCB, Hybrid-e and CZ. All of 
the algorithms are evaluated at the parameter values in which 
they perform the best. As seen from the table the CTR for 
DISCOM with differential services is 16%, 5% and 7% higher 
than the best of LinUCB, Hybrid-e and CZ for contexts Xi, 
X 2 and (xi,X 2 ), respectively. 

Table |V] compares the performance of DISCOM, the per¬ 
centage of training, exploration and exploitation phases for 
different control functions (different z parameters) for sim¬ 
ulations with context X 2 . As expected, the percentage of 
trainings and explorations increase with the control function. 
As z increases matching actions are explored with a higher 
accuracy, and hence the average exploitation reward (CTR) 
increases. 


D. Music Dataset Simulations 


Table [VI] compares the performance of DISCOM, LinUCB, 
Hybrid-e and CZ for the music dataset. The parameter values 

are z = 1/8 


used for DISCOM for the result in Table VI 


and rriT = 4. From the results is is observed that DISCOM 
achieves 10% improvement over LinUCB, 5% improvement 
over Hybrid-e, and 28% improvement over CZ in terms of 
the average number of likes achieved for the users of CA 1. 
Moreover, the average number of likes received by DISCOM 
for the high type users (differential services) is even higher, 
which is 13%, 8% and 32% higher than LinUCB, HE and CZ, 
respectively. 


VIII. Conclusion 

In this paper we considered novel online learning algo¬ 
rithms for content matching by a distributed set of CAs. We 
have characterized the relation between the user and content 
characteristics in terms of a relevance score, and proposed 
online learning algorithms that learns to match each user 
with the content with the highest relevance score. When the 
user and content characteristics are static, the best matching 
between content and each type of user can be learned perfectly, 
i.e., the average regret due to suboptimal matching goes to 
zero. When the user and content characteristics are dynamic, 
depending on the rate of the change, an approximately optimal 
matching between content and each user type can be learned. 
In addition to our theoretical results, we have validated the 
concept of distributed content matching on real-world datasets. 
An interesting future research direction is to investigate the 
interaction between different CAs when they compete for the 
same pool of users. Should a CA send a content that has a 
high chance of being liked by another CA’s user to increase 
its immediate reward, or should it send a content that has a 
high chance of being disliked by the other CA’s user to divert 
that user from using that CA and switch to it instead. 

Appendix A 

A BOUND ON DIVERGENT SERIES 
For p > 0, p ^ 1, 


E. Yahoo! Today Module with Drift Simulations 


Table VII compares the performance of DISCOM-W with 
half window length (r/i = 2500) and itit = 10, DISCOM 
(with niT set equal to simulations with a single context 

dimension and for the simulation with two context 

dimensions), LinUCB, Hybrid-e and CZ. For the results in 
the table, the z parameter value of DISCOM and DISCOM- 
W are set to the z value in which they achieve the highest 
number of clicks. Similarly, LinUCB, Hybrid-e and CZ are 


Algorithm 

DISCOM 

DISCOM 
(diff. serv.) 

LinUCB 

Hybrid-e 


Avg. num. 
of likes 

0.717 

0.736 

0.652 

0.683 

0359 


TABLE VI 

Comparison among DISCOM and other learning algorithms 
FOR MD. 


T 

J2t-P<l + {T^-P-l)/{l-p). 

Proof: See p2) . ■ 

Appendix B 
Proof oe Theorem[T] 

A. Necessary Definitions and Notations 

Let j3a ■= l/^“> ^nd let log(.) denote logarithm in 

base e. For each hypercube p G Vt let Tfc,p '■= sup^.^^ tTc(x), 
^c,p ■= infxGpTicCa;), for c S C, and JIl j, := sup^g^ ^“^(x), 
p/ := infj^gpp^(x), for k G K-i. Let x* be the context 
at the center (center of symmetry) of the hypercube p. We 
dehne the optimal matching action of CA i for hypercube 
p as fc*(p) := argmax^g^. p^(xp. When the hypercube p 
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is clear from the context, we will simply denote the optimal 
matching action for hypercube p with k*. Let 

c;it) := {k€ic.: + 6)f-^/2} 

be the set of suboptimal matching actions of CA i at time t 
in hypercube p. Also related to this let 

q{t) := {c e C, : + 6)f-^/2} 

be the set of suboptimal contents of CA j at time t in 
hypercube p, where c*{p) = argmax^g^^. Wcixp). Also when 
the hypercube p is clear from the context we will just use c*. 
The contents in C^{t) are the ones that CA j should not select 
when called by another CA. The regret given in (3) can be 
written as a sum of three components: 

i?.(T) = E[RUT)] + E[RUT)] + 

where Rf{T) is the regret due to trainings and explorations by 
time T, Rf{T) is the regret due to suboptimal matching action 
selections in exploitations by time T and R^{T) is the regret 
due to near optimal matching action selections in exploitations 
by time T, which are all random variables. 

B. Bounding the Regret in Training, Exploration and Exploita¬ 
tion phases. 

In the following lemmas we will bound each of these terms 
separately. The following lemma bounds E[i?®(T)]. 

Lemma 1: Consider all CAs running DISCOM with param¬ 
eters Hi(t) = R logf, H 2 {t) = Cinaxi^ logf, Hs,{t) = R logf 
and niT = where 0 < 2 < 1 and 0 < «: < 1/d. Then, 

we have 

E[i?f(T)] < 2^+i(|C,| + (M - l)(C„,ax + l))r^+"‘'logT 
+ 2''+i(|C,| +2(M- 1))T'‘'^. 

Proof: Since time slot f is a training or an exploration 
slot for CA i if and only if 

up to time T, there can be at most [T^logT] exploration 
slots in which a content c € is matched with the user of 
CA i, [C'maxT'^ logT] training slots in which CA i selects CA 
j G M-i, [T^logT] exploration slots in which CA i selects 
CA j G M.~i. Result follows from summing these terms and 
the fact that {rriTY ^ for any T > 1. The additional 

factor of 2 comes from the fact that the realized regret at any 
time slot can be at most 2. ■ 

For any k G JCi and p G Vt, the sample mean p{t) of 
the relevance score of matching action k represents a random 
variable which is the average of the independent samples in set 
El p{t). Since these samples are not identically distributed, in 
order to facilitate our analysis of the regret, we generate two 
different artificial i.i.d. processes to bound the probabilities 
related to pi p{t) = fl p{t) — d/, k G JCi. The first one is the 
best process for CA i in which the net reward of the matching 
action k for a user with context in p is sampled from an i.i.d. 
Bernoulli process with mean pi p, the other one is the worst 
process for CA i in which this net reward is sampled from 


an i.i.d. Bernoulli process with mean p/ . Let denote 

the sample mean of the 2 : samples from the best process and 
P^’p{z) denote the sample mean of the 2 : samples from the 
worst process for CA i. We will bound the terms E[i?”(T)] 
and E[i?f(T)] by using these artificial processes along with 
the similarity information given in Assumption 1. 

Let be the event that a suboptimal content c G Cj 

is selected by CA j G Ai-i, when it is called by CA i for 
a context in set p for the fth time in the exploitation phases 
of CA i. Let Xjp{t) denote the random variable which is 
the number of times CA j selects a suboptimal content when 
called by CA i in exploitation slots of CA i when the context 
is in set p € Vt by time t. Clearly, we have 

xipit)= 

t'=l 

where I( ) is the indicator function which is equal to 1 if the 
event inside is true and 0 otherwise. The following lemma 
bounds E[i?f(T)]. 

Lemma 2: Consider all CAs running DISCOM with param¬ 
eters Hi{f) = R logf, H2{f) = Cinaxi^ logL Hz{f) = R logf 
and rriT = where 0 < 2 < 1 and k = zjifl'f). Then, 

we have 

E[i?nT)] <4(|C,|+M)/32 

+ 4(|Ci| -f M)MC'max/d2^^- 2 / 2 ’ 

Proof: Consider time t. For simplicity of notation let p = 
Pi{t). Let 

:= {Mtpp,){t) U A4-= 0} 

be the event that CA i exploits at time t. 

First, we will bound the probability that CA i selects a 
suboptimal matching action in an exploitation slot. Then, using 
this we will bound the expected number of times a suboptimal 
matching action is selected by CA i in exploitation slots. Note 
that every time a suboptimal matching action is selected by 
CA i, since pl{x) = ttI{x) — dl G [—1,1] for all k G /Ci, 
the realized (hence expected) loss is bounded above by 2. 
Therefore 2 times the expected number of times a suboptimal 
matching action is chosen in an exploitation slot bounds the 
regret due to suboptimal matching actions in exploitation slots. 
Let V/(f) be the event that matching action k G Xi is chosen 
at time t by CA i. We have 

T 

Taking the expectation 

T 

E[Rt{T)]<2Y, E P(Vfe(f),WXi))- (10) 

Let Bjp{t) be the event that at most samples in £jp{t) 
are collected from suboptimal content of CA j. Let := 
HjeAt p ^ 1®'- denote the complement 
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of that set. For any k G JCi, we have 

P(v^,pW,wXi)) 

+ P +P{B\tr) 

+ P (^Bk,p{t) ^ Afe*.p(^)j Afc,p(^) < Bk,p + Ht, 

(11) 

for some Ht > 0. We have for any suboptimal matching action 
k G Cpit), 

P (Afc,pW > i^l-,p{t),Bl,p{t) < bI,p + Ht, 

^l^,pit) > i/^, p- Ht,W\t),Blp{t)) 

< p (A"4(i^fc.p(^)i) ^ Arf:p(if^..p(oi) - 

Afc’;(|ffc,p(t)l) < Bl,p + L [vQ/mry + Ht + 
f^f,pmt,pm > - L (v^/mr)^ - Ht, W\t)) . 

For k G Cp(t), when 

2L (^Vd/mrY + 2Ht + < {ALd?/"^ + (12) 

the three inequalities given below 

B^pinM) < T^Kp + ^ (v^Mt) V Ht + 
Bllp(\^k,p{t)\) > bI*,p ~ ^ (yd/iTT-ry - Ht 

together imply that Afc,p(l^fc,p(i)l) < Bl’lpi\^k,pii)\) “ 

which implies that 

P (aI,p( 0 > Al-,pW, Afc,pW < V‘l,p + Ht, 

f^K,p{t) > llkt,p - Ht,W\t),Blp{t)) = 0. (13) 

Let Ht = 2t^~^ + Ld'^/'^rrhp. A sufficient condition that 
implies ( [T^ is 

ALd?/H-^"< + Qt^-^ < (14) 

which holds for alH > 1 when (j) = z/2 and K'y > z/2. Using 
a Chemoff-Hoeffding bound, for any k G Bp,^^^{t), since on 
the event >V®(f), |ffep.(t)(f)| > B'^ogt, we have 

P (Alp(t) > JA,p + Ht, yV\t), B\t)) < f-2 (15) 

and 

P (AI.,pW < - Ht,W\t),B\t)) < t-\ (16) 


Since W*(f)} = {Xlp{t) > by applying the 

Markov inequality, we have 

PiBlpitr,W\t))<E[Xlpit)]t-t 


Since 


xip(t)= mipit')) 


and 

p(-;.pW) 

< XI P (^m.pW > ^c*,p 


mGCp(t) 

^ X (P ^ >V*(i)) 

meclit) 


+p 


(r'^*,p(i) < Ac-.p - Ht, W*(f)) + P > Pc.,p(i), 

rin,p(t) < ^m,p + Ht,rl. p{t) > TTp.^p - Ht,W\t)y . 

When ( [l4| ) holds, the last probability in the sum above is equal 
to zero while the first two probabilities are upper bounded by 

yyg Jj^yg 

P (S} p(f)) < Y 2e-2(^‘P‘^i°s‘ < 2|C,|f-2. 

This implies that 

oo oo 

mu*)] < X p(^},p(^')) < miYm^- 

t'm t'm 

Therefore, by the Markov inequality and union bound we get 

p{Bmtmm{t))=pixipU*) ^ **) 

< 2\cm*~"^^ 

and 

P{B\ty, W\t)) < (17) 

Then, using ( [T3] l, ( [T5] l, ( [T6] l and ( [T7] l, we have 

P {Vl{t),W\t)) < 2f-2 + 2MC„,ax/32^-"/^ 

for any k G and By (lOi, and by the result of 

Appendix A, we get the stated boiW for E[i?f(r)]. ■ 

The next lemma bounds E[ii"(T)]. 

Lemma 3: Consider all CAs running DISCOM with param¬ 
eters Hi{t) = U logf, H 2 {t) = logf, Hsit) = U logf 

and rriT = where 0 < 0 < 1 and k = 2 /( 27 ). Then, 

we have 

E[i?r(r)] < (14-^^^^^ + 12) yi-./2 ^ 

Proof: At any time t, for any k G ICi — Cp{t) and x G p, 
we have 

- bUx) < {ILdP*"^ + 

Similarly, for any j G f4, c G Cj — C^{t) and x G p, we have 
'^c*(x){x) - 7rc(x) < (JLd'^*'^ + 

Due to the above inequalities, if a near optimal action in 
Ci n {K-i — Cp{t)) is chosen by CA i at time t, the contribution 
to the regret is at most {TLdP*'^ + If a near optimal 
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CA j G j\4-in(/Ci — £p(t)) is called by CA i at time t, and if 
CA j selects one of its near optimal contents in Cj—C^{t), then 
the contribution to the regret is at most 2(7Ld'’'/^ + 
Moreover since we are in an exploitation step, the near-optimal 
CA j that is chosen can choose one of its suboptimal contents 
in C^{t) with probability at most 2Cmaxi~^, which will result 
in an expected regret of at most 

Therefore, the total regret due to near optimal choices of 
CA i by time T is upper bounded by 


(14Ld^/2 ^ ^2) ^ + 4C„,ax ^ t- 

1 — z/2 


by using the result in Appendix [A| ■ 

Next, we give the proof ot Theorem 1 by combining the 
results of the above lemmas. 


C. Proof of Theorem 1 

The highest orders of regret that come from Lemmas 1, 
2 and 3 are We need 

to optimize them with respect to the constraint Regret is 
minimized when Kd + z = 1 — z/2, which is attained by 
z = 27/(37 + d). Result follows from summing the bounds 
in Lemmas [T] and 


Appendix C 

Proof of Corollary[T] 

From the proof of Lemma 2, for 2 ; = 27/(87 + d), we have 
P (V/(f), W\t)) < 2f-2 + 2MCn,ax/32f^/^^^+‘^^ 
for k G This implies that 


F (a^ G 





/ m\ , 

2|/C,|MC'n,ax/32 

- + 

t'y/i^'T+d) 


The difference between the expected reward of an action 
within a hypercube from its expected reward at the center 
of the hypercube is at most Ldp!"^/{jriTy. Since tot = 
ai[t) G ICi — implies that 

TUt)Mt)) > Mfc.(.7*))(a;.(f)) - (6Ld^/2 + 6)r-^/(37+^). 


Appendix D 
Proof of Theorem|2] 

In order for time t to be an exploitation slot for CA i it is 
required that U UC“ = 0. Since 

the counters of DISCOM are updated only when feedback 
is received, and since the control functions are the same 
as the ones that are used in the setting where feedback 
is always available, the regret due to suboptimal and near 
optimal matching actions by time t with missing feedback 


will not be any greater than the regret due to suboptimal 
and near optimal matching actions for the case when the 
users always provide feedback. Therefore, the bounds given 
in Lemmas 2 and 3 will also hold for the case with missing 
feedback. Only the regret due to trainings and explorations 
increases, since more trainings and explorations are needed 
before the counters exceed the values of the control functions 
such that the relevance score estimates are accurate enough 
to exploit. Consider any p G Vt- From the dehnition of 
DISCOM, the number of exploration slots in which content 
c G Ci is matched with CA z’s user and the user’s feedback 
is observed is at most |"T^')'/(37-i-'i)l xjje number of training 
slots in which CA i requested content from CA j G M.-i 
and received the feedback about this content from its user is 
at most logP]- The number of exploration 

slots in which CA i selected CA j G M.-i is at most 

|‘X27/(37-|-d) logT]. 

Let rexp(T) be the random variable which denotes the 
smallest time step for which for each c G Ci there are 
|- 7 n 27 /( 37 +d)i feedback observations, for each j G M.-i there 
are log T] feedback observations for the 

trainings and log T] feedback observations for 

the explorations. Then, E[Texp(T’)] is the expected number 
of training plus exploration slots by time T. Let yexp(f) 
be the random variable which denotes the number of time 
slots in which the feedback is not provided by the users 
of CA i till CA i received t feedbacks from its users. Let 
A,{T) = Z,r 27 /( 37 -td) logT + (|C,| + 2(M - 1)). We have 

E[Texp(T)] = E[yexp(4l,(T))] + MT). 

Tl^xp(^^('F)) is a negative binomial random variable with 
probability of observing no feedback at any time t equals to 
1 — Pr- Therefore, 

E[yexp(4l*(r))] = il-pr)MT)/pr. 

Using this, we get 

E[Texp(r)] = MT)/pr. 

The regret bound follows from substituting this into the proof 
of Theorem 1. 


Appendix E 
Proof of Theorem[3] 

The basic idea is to choose Th in a way that the regret 
due to variation of relevance scores over time and the regret 
due to variation of estimated relevance scores due to the 
limited number of observations during each round is balanced. 
Majority of the steps of this proof is similar to the proof of 
Theorem 1 hence some of the steps are omitted. 

Consider a round 7 of length 2Th. Denote the set of time 
slots in round rj by [p]. For any c € C let 

sup Trc,t{x), 

xGp,tG [rj] 

inf tTcAx). 

xGp,tG [ 77 ] 


'^c,p,r] • — 

TT * = 
—c,p,ri 
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For any k G JCi let 

— i 

f^k,p,T] 


—k,p,ri 


sup 

xGp,tG[r]] 

inf 

xGp,tG[r]] 




where /i^ ^ {x) is defined as the time-varying version of {x) 
given in (1) under Assumption 2. For CA i, the set of 
suboptimal matching actions is given as 

'■= e : tLkf{p),p,7j ~ ^k,p,r, 

> {4L(P/'^ + 6 )(< mod th + ’ 

where k* (p) is the matching action with the highest net reward 
for the context at the center of p at the time slot in the middle 
of round p. 

Consider the regret due to explorations and trainings for 
DISCOM^ incurred over times when it is in the active sub¬ 
phase (over Th time slots). Similar to the proof of Lemma 1 it 
can be shown that the regret due to trainings and explorations 
is 


E[R<iiTh)] = d{r^+^‘^). 

Similar to the proof of Lemma 2, it can be shown that the 
regret due to suboptimal matching action selections is 

E[i?nT^)] = O 

when K = zl{2p). Since the definition of a sub-optimal 
matching action is different for dynamic user and content 
characteristics, the regret due to near optimal matching actions 
in ICi —Cp p(t) is different from Lemma 3. At time t which is 
in round p, since a near optimal matching action’s contribution 
to the one-step regret is at most 

+ 12 )(f mod Th + -f 4Th/T, 


summing over all time slots in a round p, we have 


EiRUTh)] = O + O 


T, 


Clearly we have E[Rf{Th)] < E[Rf{Th)]. Let Th = [T^\ for 
some 6 > 0. Then we have 


mtirh)] 


Th 


= O {Tf 




and 


= O + O (T/- 1 ). 

'Th X / 


Th 

The sum (E[i?f (r/i)] -f E[i?f(rft)] -f E[R^{Th)])/Th is mini¬ 
mized by setting z = 27/(87 + d) and (j) = 1/(1 -f z/2). 

3-Y + d 

Hence, Th = J the order of the time averaged regret is 

equal to O f 
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